Acoustic interval detection method and device

ABSTRACT

There is provided a harmonic structure acoustic signal detection device not depending on the level fluctuation of the input signal, having an excellent real time property and noise resistance. The device includes: an FFT unit ( 200 ) which performs FFT on an input signal and calculates a power spectrum component for each frame; a harmonic structure extraction unit ( 201 ) which leaves only a harmonic structure from the power spectrum component; a voiced feature evaluation unit ( 210 ) which evaluates correlation between the frames of harmonic structures extracted by the harmonic structure extraction unit ( 201 ), thereby evaluates whether or not the segment is a vowel segment, and extracts the voiced segment; and a speech segment determination unit ( 205 ) which determines a speech segment according to the continuity and durability of the output of the voiced feature evaluation unit ( 210 ).

TECHNICAL FIELD

The present invention relates to a harmonic structure signal andharmonic structure acoustic signal detection method of detecting, froman input acoustic signal, a signal having a harmonic structure and astart and end point of a segment including speech in particular as aspeech segment, and particularly to a harmonic structure signal andharmonic structure acoustic signal detection method used under theenvironmental noise situation.

BACKGROUND ART

Human voice is produced by vibration of vocal folds and resonance ofphonatory organs. It is known that a human being produces various soundsin order to change the loudness and pitch of his voice by controllinghis vocal folds to change the frequency of their vibration or bychanging the positions of his phonatory organs such as a nose and atongue, namely by changing the shape of his vocal tract. It is alsoknown that, when considering the voice produced as such as an acousticsignal, the feature of such an acoustic signal is that it containsspectral envelope components which change gradually according to thefrequencies and spectral fine structure components which changeperiodically in a short time (for the case of voiced vowels and thelike) or which change aperiodically (for the case of consonants andunvoiced vowels). The former spectral envelope components represent theresonance features of the phonatory organs, and used as featuresindicating the shapes of a human throat and mouth, for example, asfeatures for speech recognition. On the other hand, the latter spectralfine structure components represent the periodicity of the sound source,and used as features indicating the fundamental periods of vocal folds,namely the voice pitches. The spectrum of a speech signal is expressedby the product of these two elements. A signal which contains the lattercomponent which clearly indicates the fundamental period and theharmonic component thereof, particularly in a vowel part or the like, isalso called a harmonic structure.

Conventionally, various methods for detecting a speech segment from aninput acoustic signal have been suggested. They are roughly classifiedinto the following: a method for identifying a speech segment usingamplitude information such as frequency band power and spectral envelopeindicating the rough shape of the spectrum of an input acoustic signal(hereinafter referred to as “method 1”); a method for detecting theopening and closing of a mouth in a video by analyzing it (“method 2”);a method for detecting a speech segment by comparing an acoustic modelwhich represents speech and noise with the feature of an input acousticsignal (“method 3”); and a method for determining a speech segment byfocusing attention on a speech spectral envelope shape determined by theshape of a vocal tract and a harmonic structure which is created by thevibration of vocal folds, which are both the features of articulatoryorgans (“method 4”).

However, the method 1 has an inherent problem that it is difficult todistinguish between speech and noise based on amplitude informationonly. So, in the method 1, a speech segment and a noise segment areassumed and the speech segment is detected by relearning a thresholdvalue determined in order to distinguish between the speech segment andthe noise segment. Therefore, when the amplitude of the noise segmentagainst the amplitude of the speech segment (namely, the speechsignal-to-noise ratio (hereinafter referred to as “SNR”)) becomes largeduring the process of learning, the accuracy of the assumption itself ofthe noise segment and the speech segment has an influence on theperformance, which reduces the accuracy of the threshold learning. As aresult, there occurs a problem that the performance of speech segmentdetection is degraded.

In the method 2, it is possible to maintain the detection/estimationaccuracy of a speech segment constant regardless of the SNR if theopening of a mouth during the speech segment is detected, for example,not using sound input but only using an image. However, there areproblems that the image analyzing processing costs more than the speechsignal analyzing processing, and a speech segment cannot be detected ifa mouth does not face toward a camera.

In the method 3, it is difficult to assume noise in itself while theperformance under the assumed environmental noise is ensured, so thismethod is available in the limited environment only. Although thismethod suggests a technique to learn the noise environment on the site,such technique has a problem that the performance is degraded dependingon the accuracy of the learning method, as is the case with the methodusing amplitude information (i.e., the method 1).

On the other hand, the method 4 has been suggested, in which a speechsegment is detected by focusing attention on the spectral envelope shapedetermined by the vocal tract shape as well as the harmonic structurecreated by the vibration of vocal folds, which are the features ofarticulatory organs.

The method using the spectral envelope shape includes a method forevaluating the continuity of band power, for example, cepstra. In thismethod, the performance is degraded because it is hard to distinguishnoise offset components under the lowered SNR situation.

A pitch detection method is one of the methods focusing attention on theharmonic structure, and various other methods have been suggested, suchas a method for extracting auto-correlation and higher quefrency part inthe time domain and a method for creating auto-correlation in thefrequency domain. However, these methods have problems, for example, itis difficult to extract a speech segment if a current signal does nothave a single pitch (harmonic fundamental frequency), and an extractionerror is likely to occur due to environmental noise.

Additionally, there is a well-known technique of accentuating,suppressing, or separating and extracting an acoustic signal having aharmonic structure such as a human voice and a specific musicalinstrument, from an acoustic signal consisting of a mixture of severalkinds of acoustic signals. For example, the following methods have beensuggested: for speech signals, a noise reduction device which reducesonly noise in an acoustic signal consisting of a mixture of noisesignals and speech signals (See, for example, Japanese Laid-Open PatentApplication No. 09-153769 Publication); and for music signals, a methodfor separating and removing a melody included in played music signal(See, for example, Japanese Laid-Open Patent Application No. 11-143460Publication).

However, according to the method described in Japanese Laid-Open PatentApplication No. 09-153769 Publication, speech and non-speech aredetected by observing a linear predictive residual signal in eachfrequency band of an input signal. Therefore, this method has a problemthat the performance is degraded under the non-stationary noisecondition with the lower SNR in which the linear prediction does notwork well.

The method described in Japanese Laid-Open Patent Application No.11-143460 Publication is a method using the feature specific to melodiesin music that a sound of the same pitch continues for a predeterminedperiod of time. Therefore, there is a problem that it is difficult touse this method as it is for separation between speech and noise. Inaddition, a large amount of processing required for this method becomesa problem if it does not aim to separate or remove acoustic components.

A method using the acoustic feature itself which represents a harmonicstructure as an evaluation function has also been suggested (See, forexample, Japanese Laid-Open Patent Application No. 2001-222289Publication). FIG. 32 is a block diagram showing an outline structure ofa speech segment determination device which uses the method suggested inJapanese Laid-Open Patent Application No. 2001-222289 Publication.

A speech segment detection device shown in FIG. 32 is a device whichdetermines a speech segment in an input signal, and includes a fastFourier transform (FFT) unit 100, a harmonic structure evaluation unit101, a harmonic structure peak detection unit 102, a pitch candidatedetection unit 103, an inter-frame amplitude difference harmonicstructure evaluation unit 104 and a speech segment determination unit105.

The FFT unit 100 performs FFT processing on an input signal for eachframe (for example, one frame is 10 msec) so as to perform frequencytransform on the input signal, and carries out various analyses thereof.The harmonic structure evaluation unit 101 evaluates whether or not eachframe has a harmonic structure based on the frequency analysis resultobtained from the FFT unit 100. The harmonic structure peak detectionunit 102 converts the harmonic structure extracted by the harmonicstructure evaluation unit 101 into the local peak shape, and detects thelocal peak.

The pitch candidate detection unit 103 detects a pitch by tracking thelocal peaks detected by the harmonic structure peak detection unit 102in the time axis direction (frame direction). A pitch denotes thefundamental frequency of a harmonic structure.

The inter-frame amplitude difference harmonic structure evaluation unit104 calculates the value of the inter-frame difference of the amplitudesobtained as a result of the frequency analysis by the FFT unit 100, andevaluates whether or not the current frame has a harmonic structurebased on the difference value.

The speech segment determination unit 105 makes a comprehensive judgmentof the pitch detected by the pitch candidate detection unit 103 and theevaluation result by the inter-frame amplitude difference harmonicstructure evaluation unit 104 so as to determine the speech segment.

According to the speech segment detection device 10 shown in FIG. 32, itbecomes possible to determine a speech segment not only in an acousticsignal having a single pitch but also in an acoustic signal having aplurality of pitches.

However, when the pitch candidate detection unit 103 tracks local peaks,appearance and disappearance of such local peaks have to be considered,and it is difficult to detect the pitch with high accuracy consideringsuch appearance and disappearance.

In view of the fact that a peak that is a local maximum value ishandled, resistance to noise cannot be expected so much. In addition,the inter-frame amplitude difference harmonic structure evaluation unit104 evaluates whether or not the difference between frames has aharmonic structure in order to evaluate temporal fluctuations. However,since it just uses the difference of amplitudes, it has a problem thatnot only the information of the harmonic structure is lost but also theacoustic feature itself of a sudden noise is evaluated as a differencevalue if such a sudden noise occurs.

Against this backdrop, the present invention has been conceived in orderto solve the above-mentioned problems, and it is an object of thepresent invention to provide a harmonic structure acoustic signaldetection method and device which allow highly accurate detection of aspeech segment, not depending on the level fluctuations of an inputsignal.

It is another object thereof to provide a harmonic structure acousticsignal detection method and device with outstanding real-time features.

DISCLOSURE OF INVENTION

A harmonic structure acoustic signal detection method in an aspect ofthe present invention is a method of detecting, from an input acousticsignal, a segment that includes a signal having a harmonic structure,particularly speech, as a speech segment, the method including: anacoustic feature extraction step of extracting an acoustic feature ineach of frames into which the input acoustic signal is divided at everypredetermined time period; and a segment determination step ofevaluating continuity of the acoustic features and of determining aspeech segment according to the evaluated continuity.

As described above, a speech segment is determined by evaluating thecontinuity of acoustic features. Unlike the conventional method oftracking local peaks, there is no need to consider the fluctuations ofthe input acoustic signal level resulting from appearance anddisappearance of local peaks, therefore a speech segment can bedetermined with accuracy.

It is preferable that in the acoustic feature extraction step, frequencytransform is performed on each frame of the input acoustic signal, and aharmonic structure is accentuated based on each component obtainedthrough the frequency transform and the acoustic feature is extracted.

A harmonic structure is seen in speech (particularly in a vowel sound).Therefore, by determining a speech segment using the acoustic feature inwhich the harmonic structure is accentuated, the speech segment can bedetermined with higher accuracy.

It is further preferable that in the acoustic feature extraction step, aharmonic structure is further extracted from each component obtainedthrough the frequency transform, and an acoustic feature is obtainedthrough a component that consists of a predetermined frequency band thatincludes the extracted harmonic structure.

By determining a speech segment using the acoustic feature of the frameincluding only the frequency bands in which harmonic structure areclearly maintained, the speech segment can be determined with higheraccuracy.

It is further preferable that in the segment determination step,continuity of the acoustic features is evaluated based on a correlationvalue between the acoustic features of frames.

As described above, the continuity of harmonic structures is evaluatedbased on the correlation value between the acoustic features of frames.Therefore, compared with the conventional method of evaluating thecontinuity of harmonic structures based on the amplitude differencebetween frames, better evaluation can be made using more information ofthe harmonic structures. As a result, even in the case where a suddennoise over a short period of frames occurs, such a sudden noise is notdetected as a speech segment, and thus a speech segment can be detectedwith accuracy.

It is further preferable that the segment determination step includes:an evaluation step of calculating an evaluation value for evaluating thecontinuity of the acoustic features; and a speech segment determinationstep of evaluating temporal continuity of the evaluation values and ofdetermining a speech segment according to the evaluated temporalcontinuity.

As described in the embodiment, the processing in the speech segmentdetermination step corresponds to the processing for concatenatingtemporally adjoining voiced segments (voiced segments obtained basedonly on the evaluation values) so as to detect a speech segmentprecisely. The speech segment determined through concatenating thetemporally adjoining voiced segments, it may lead to include a consonantportion that has a smaller evaluation value for harmonic structure thanthat within a vowel portion.

It is further possible to figure out whether a segment having a harmonicstructure is speech or non-speech like music by evaluating the segmentin detail. As for the frames judged to have a harmonic structure, byevaluating the continuity of number indices of the frequency bands, inwhich the maximum or minimum value for harmonic structure is detected,it is possible to assess if the segment is speech or music.

As for the segment which is considered to have a harmonic structureusing the continuity of the evaluation values for the harmonicstructures, it is possible to judge, using its distribution of theevaluation values, whether such a segment is a transmutation from thespeech or music segments having continuous harmonic structures, or asudden noise having a harmonic structure.

As for the segments other than the segments having the above-mentionedfeatures of harmonic structures, it is possible to judge them to be thesegments regarded as silence because an input signal is weak or thenon-harmonic structure segments having no harmonic structure.

As shown in the fifth embodiment, the present invention discloses amethod for judging if each frame has a harmonic structure whilereceiving a sound signal.

It is further preferable that the segment determination step furtherincludes: a step of estimating a speech signal-to-noise ratio of theinput acoustic signal based on comparisons, for a predetermined numberof frames, between (i) acoustic features extracted in the acousticfeature extraction step or the evaluation values calculated in theevaluation step and (ii) a first predetermined threshold; and a step ofdetermining the speech segment based on the evaluation value calculatedin the evaluation step, in the case where the estimated speechsignal-to-noise ratio is equal to or higher than a second predeterminedthreshold, and in the speech segment determination step, the temporalcontinuity of the evaluation values is evaluated and the speech segmentis determined according to the evaluated temporal continuity, in thecase where the speech signal-to-noise ratio is lower than the secondpredetermined threshold.

Accordingly, in the case where the estimated speech signal-to-noiseratio of an input acoustic signal is high, it is possible to omitevaluating the temporal continuity of the evaluation values forevaluating the continuity of acoustic features for determining thespeech segment. Therefore, the speech segment can be detected withoutstanding real-time features.

Note that the present invention can be embodied not only as theabove-mentioned harmonic structure acoustic signal segment detectionmethod but also as a harmonic structure acoustic signal segmentdetection device including, as units, the steps included in that method,and as a program causing a computer to execute each of the steps of theharmonic structure acoustic signal detection method. It is needless tosay that the program can be distributed via a storage medium such asCD-ROM and a transmission medium such as the Internet.

As described above, according to the harmonic structure acoustic signaldetection method and device, it becomes possible to separate betweenspeech segments and noise segments accurately. It is possible to improvethe speech recognition level particularly by applying the presentinvention as a pre-process for the speech recognition method, andtherefore the practical value of the present invention is extremelyhigh. It is also possible to efficiently use memory capacity, such asrecording of only speech segments, by applying the present invention toan integrated circuit (IC) recorder or the like.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a block diagram showing a hardware structure of a speechsegment detection device according to a first embodiment of the presentinvention.

FIG. 2 is a flowchart of processing performed by the speech segmentdetection device according to the first embodiment.

FIG. 3 is a flowchart of harmonic structure extraction processing by aharmonic structure extraction unit.

FIG. 4(a) to (f) is a diagram schematically showing processes ofextracting spectral components which contain only harmonic structuresfrom spectral components of each frame.

FIG. 5(a) to (f) is a diagram showing transition of input signaltransform according to the present invention.

FIG. 6 is a flowchart of speech segment determination processing.

FIG. 7 is a block diagram showing a hardware structure of a speechsegment detection device according to a second embodiment of the presentinvention.

FIG. 8 is a flowchart of processing performed by the speech segmentdetection device according to the second embodiment.

FIG. 9 is a block diagram showing a hardware structure of a speechsegment detection device according to a third embodiment.

FIG. 10 is a flowchart of processing performed by the speech segmentdetection device.

FIG. 11 is a diagram for explaining harmonic structure extractionprocessing.

FIG. 12 is a flowchart showing the details of the harmonic structureextraction processing.

FIG. 13(a) is a diagram showing power spectra of an input signal. FIG.13(b) is a diagram showing harmonic structure values R(i). FIG. 13(c) isa diagram showing band numbers N(i). FIG. 13(d) is a diagram showingweighted band numbers Ne(i). FIG. 13(e) is a diagram showing correctedharmonic structure values R′(i).

FIG. 14(a) is a diagram showing power spectra of an input signal. FIG.14(b) is a diagram showing harmonic structure values R(i). FIG. 14(c) isa diagram showing band numbers N(i). FIG. 14(d) is a diagram showingweighted band numbers Ne(i). FIG. 14(e) is a diagram showing correctedharmonic structure values R′(i).

FIG. 15(a) is a diagram showing power spectra of an input signal. FIG.15(b) is a diagram showing harmonic structure values R(i). FIG. 15(c) isa diagram showing band numbers N(i). FIG. 15(d) is a diagram showingweighted band numbers Ne(i). FIG. 15(e) is a diagram showing correctedharmonic structure values R′(i).

FIG. 16(a) is a diagram showing power spectra of an input signal. FIG.16(b) is a diagram showing harmonic structure values R(i). FIG. 16(c) isa diagram showing band numbers N(i). FIG. 16(d) is a diagram showingweighted band numbers Ne(i). FIG. 16(e) is a diagram showing correctedharmonic structure values R′(i).

FIG. 17 is a detailed flowchart of speech/music segment determinationprocessing.

FIG. 18 is a block diagram showing a hardware structure of a speechsegment detection device according to a fourth embodiment.

FIG. 19 is a flowchart of processing performed by the speech segmentdetection device.

FIG. 20 is a flowchart showing the details of harmonic structureextraction processing.

FIG. 21 is a flowchart showing the details of speech segmentdetermination processing.

FIG. 22(a) is a diagram showing power spectra of an input signal. FIG.22(b) is a diagram showing harmonic structure values R(i). FIG. 22(c) isa diagram showing weighted distributions Ve(i). FIG. 22(d) is a diagramshowing speech segments before being concatenated. FIG. 22(e) is adiagram showing speech segments after being concatenated.

FIG. 23(a) is a diagram showing power spectra of an input signal. FIG.23(b) is a diagram showing harmonic structure values R(i). FIG. 23(c) isa diagram showing weighted distributions Ve(i). FIG. 23(d) is a diagramshowing speech segments before being concatenated. FIG. 23(e) is adiagram showing speech segments after being concatenated.

FIG. 24 is a flowchart showing another example of the harmonic structureextraction processing.

FIG. 25(a) is a diagram showing an input signal. FIG. 25(b) is a diagramshowing power spectra of the input signal. FIG. 25(c) is a diagramshowing harmonic structure values R(i). FIG. 25(d) is a diagram showingweighted harmonic structure values Re(i). FIG. 25(e) is a diagramshowing corrected harmonic structure values R′(i).

FIG. 26(a) is a diagram showing an input signal. FIG. 26(b) is a diagramshowing power spectra of the input signal. FIG. 26(c) is a diagramshowing harmonic structure values R(i). FIG. 26(d) is a diagram showingweighted harmonic structure values Re(i). FIG. 26(e) is a diagramshowing corrected harmonic structure values R′(i).

FIG. 27 is a block diagram showing a structure of a speech segmentdetection device according to a fifth embodiment.

FIG. 28 is a flowchart of processing performed by the speech segmentdetection device.

FIG. 29(a) to (d) is a diagram for explaining concatenation of harmonicstructure segments.

FIG. 30 is a detailed flowchart of harmonic structure frame provisionaljudgment processing.

FIG. 31 is a detailed flowchart of harmonic structure segment finaldetermination processing.

FIG. 32 is a diagram showing a rough hardware structure of aconventional speech segment determination device.

BEST MODE FOR CARRYING OUT THE INVENTION First Embodiment

A description is given below, with reference to the drawings, of aspeech segment detection device according to the first embodiment of thepresent invention. FIG. 1 is a block diagram showing a hardwarestructure of a speech segment detection device 20 according to the firstembodiment.

The speech segment detection device 20 is a device which determines, inan input acoustic signal (hereinafter referred to just as an “inputsignal”), a speech segment that is a segment during which a man isvocalizing (uttering speech sounds). The speech segment detection device20 includes an FFT unit 200, a harmonic structure extraction unit 201, avoiced feature evaluation unit 210, and a speech segment determinationunit 205.

The FFT unit 200 performs FFT on the input signal so as to obtain powerspectral components of each frame. The time of each frame shall be 10msec here, but the present invention is not limited to this time.

The harmonic structure extraction unit 201 removes noise components andthe like from the power spectral components extracted by the FFT unit200, and extracts power spectral components having only the harmonicstructures.

The voiced feature evaluation unit 210 is a device which evaluates theinter-frame correlation of the power spectral components having only theharmonic structures extracted by the harmonic structure extraction unit201 so as to evaluate whether each frame is a vowel segment or not andextract a voiced segment. The voiced feature evaluation unit 210includes a feature storage unit 202, an inter-frame feature correlationvalue calculation unit 203 and a difference processing unit 204. Notethat the harmonic structure is a property which is often seen in thepower spectral distribution in a vowel phonation segment. No suchharmonic structures as seen in the vowel phonation segment are seen inthe power spectral distribution in a consonant phonation segment.

The feature storage unit 202 stores the power spectra of a predeterminednumber of frames outputted from the harmonic structure extraction unit201. The inter-frame feature correlation value calculation unit 203calculates the correlation value between the power spectrum outputtedfrom the harmonic structure extraction unit 201 and the power spectrumof a frame which precedes the current frame by a predetermined number offrames and is stored in the feature storage unit 202. The differenceprocessing unit 204 calculates the average value of the correlationvalues calculated by the inter-frame feature correlation valuecalculation unit 203 for a predetermined period of time, subtracts theaverage value from the respective correlation values outputted from theinter-frame feature correlation value calculation unit 203, and obtainsthe corrected correlation values based on the average of the differencesbetween the correlation values and the average value.

The speech segment determination unit 205 determines the speech segmentbased on the corrected correlation value obtained from the averagedifference outputted from the difference processing unit 204.

A description is given below of the operation of the speech segmentdetection device 20 structured as above. FIG. 2 is a flowchart of theprocessing performed by the speech segment detection device 20.

The FFT unit 200 performs FFT on an input signal so as to obtain thepower spectral components thereof as the acoustic features used forextracting the harmonic structures (S2). More specifically, the FFT unit200 performs sampling on the input signal at a predetermined samplingfrequency Fs (for example, 11.025 kHz) to obtain FFT spectral componentsat predetermined number of points (for example, 128 points) per frame(for example, 10 msec). The FFT unit 200 obtains the power spectralcomponents by converting the spectral components obtained at respectivepoints into logarithms. Hereinafter, a power spectral component isreferred to just as a spectral component, if necessary.

Next, the harmonic structure extraction unit 201 removes noisecomponents and the like from the power spectral components extracted bythe FFT unit 200 so as to extract the power spectral components havingonly the harmonic structures (S4).

The power spectral components calculated by the FFT unit 200 contain thenoise offset and the spectral envelope shapes created by the vocal tractshape, and thus causes time jitter. Therefore, the harmonic structureextraction unit 201 removes these components and extracts the powerspectral components having only the harmonic structures which areproduced by vocal fold vibration. As a result, a voiced segment isdetected more effectively.

A detailed description is given, with reference to FIG. 3 and FIG. 4, ofthe processing by the harmonic structure extraction unit 201 (S4). FIG.3 is a flowchart of the harmonic structure extraction processing by theharmonic structure extraction unit 201, and FIG. 4 is a diagramschematically showing the processes of extracting spectral componentswhich have only harmonic structures from spectral components of eachframe.

As shown in FIG. 4(a), the harmonic structure extraction unit 201calculates the maximum peak-hold value Hmax(f) from the spectralcomponents S(f) of each frame (S22), and calculates the minimumpeak-hold value Hmin(f) (S24).

As shown in FIG. 4(b), the harmonic structure extraction unit 201removes the floor components included in the spectral components S(f) bysubtracting the minimum peak-hold value Hmin(f) from the respectivespectral components S(f) (S26). As a result, fluctuating componentsresulting from noise offset components and spectral envelope componentsare removed.

As shown in FIG. 4(c), the harmonic structure extraction unit 201calculates the difference value between the maximum peak-hold valueHmax(f) and the minimum peak-hold value Hmin(f) so as to calculate thepeak fluctuation (S28).

As shown in FIG. 4(d), the harmonic structure extraction unit 201differentiates the amount of peak fluctuation in the frequency directionso as to calculate the amount of change in the peak fluctuation (S30).This calculation is made for the purpose of detecting the harmonicstructures based on the assumption that the change in peak fluctuationis small.

As shown in FIG. 4(e), the harmonic structure extraction unit 201calculates the weight W(f) which realizes the above assumption (S32). Inother words, the harmonic structure extraction unit 201 compares theabsolute value of the amount of change in the peak fluctuation with apredetermined threshold value, and determines the weight W(f) to be 1when the absolute value of the change is smaller than the thresholdvalue θ, while determines the weight W(f) to be the inverse of theabsolute value of the change when it is equal to or larger than thethreshold value θ. As a result, it becomes possible to assign lighterweight on the part in which the change in the amount of peak fluctuationis larger, while to assign heavier weight on the part in which thechange is smaller.

As shown in FIG. 4(f), the harmonic structure extraction unit 201multiplies the spectral components with the floor components beingremoved (S(f)−Hmin(f)) by the weight W(f) so as to obtain the spectralcomponents S′(f) (S34). This processing allows elimination ofnon-harmonic structure components in which the change in peakfluctuation is large.

Again, the description of the operation of the speech segment detectiondevice 20 shown in FIG. 2 is given. After the harmonic structureextraction processing (S4 in FIG. 2 and FIG. 3), the inter-frame featurecorrelation value calculation unit 203 calculates the correlation valuebetween the spectral components outputted from the harmonic structureextraction unit 201 and the spectral components of a frame whichprecedes the current frame by a predetermined number of frames and isstored in the feature storage unit 202 (S6).

A description is given here of a method for calculating a correlationvalue E1(j) using spectral components of adjacent frames, assuming thatthe current frame is the jth frame. The correlation value E1(j) iscalculated according to the following equations (1) to (5). Morespecifically, power spectral components P(i) and P(i-1) at 128 points ofa frame i and a frame i-1 shall be represented by the followingequations (1) and (2). The value of a correlation function xcorr(P(i-1),P(j)) of the power spectral components P(i) and P(i-1) shall berepresented by the following equation (3). In other words, the value ofthe correlation function xcorr(P(j-1), P(j)) is the vector quantityconsisting of the inner product values of respective points. z1(i),namely, the maximum value of the vector elements of xcorr(P(j-1), P(j)),is calculated as shown in the following equation (4). This value may bethe correlation value E1(j) of the frame j, or for example, the valueobtained by adding the maximum values of three frames, as shown in thefollowing equation (5).P(i)=(p 1(i),p 2(i), . . . ,p 128(i))   (1)P(i-1)=(p 1(i-1),p 2(i-1), . . . ,p 128(i-1))   (2)xcorr(P(i-1), P(i))=(p 1(i-1)×p 1(i),p 2(i-1)×p 2(i), . . . ,p128(i-1)×p 128(i))   (3)z 1(i)=max(xcorr(P(i-1),P(i)))   (4)$\begin{matrix}{{E\quad 1(j)} = {\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 1(i)}}} & (5)\end{matrix}$

One example of the correlation value E1(j) is described below usinggraphs shown in FIG. 5. FIG. 5 shows graphs which represent signalsobtained by processing an input signal. FIG. 5(a) shows a waveform ofthe input signal. This waveform is a waveform obtained in the case wherea man phonates “aaru ando bii hoteru higashi nihon” during a time periodof about 1,200 to 3,000 msec in a vacuum cleaner noise (SNR=0.5 dB)environment. This input signal contains a sudden sound “click” which ismade when the vacuum is turned on at the point of about 500 msec, andthe sound level of the vacuum increases at the point of about 2,800 msecwhen the rotation speed of the motor is changed from low to high. FIG.5(b) shows the power of the input signal after performing FFT on theinput signal shown in FIG. 5(a), and FIG. 5(c) shows the transition ofthe correlation values obtained in the correlation value calculationprocessing (S6).

Here, the correlation value E1(j) is calculated based on the followingfindings. In other words, the correlation value of acoustic featuresbetween frames is obtained based on the fact that the harmonicstructures continue in the temporally adjacent frames. Therefore, avoiced segment is detected based on the correlation of the harmonicstructures between temporally close frames. Such temporal continuity ofharmonic structures is often seen in vowel segments. Therefore, it isdeemed that the correlation values are larger in vowel segments, whilethey are smaller in consonant segments. In other words, it is deemedthat when obtaining the correction values of power spectral componentsbetween frames by focusing attention on harmonic structures, suchcorrelation values in aperiodic noise segments become smaller. As aresult, voiced segments stand out in the signal and can be identifiedmore easily.

It is said that the duration of a vowel segment is 50 to 150 msec (5 to15 frames) at the normal speech speed, and it is therefore assumed thatthe value of a correlation coefficient between frames is large withinthat duration even if the frames are not adjacent to each other. If thisassumption is correct, it is true that this correlation value is anevaluation function which is resistant to aperiodic noise. Thecorrelation value E1(j) is calculated using the sum of the values ofcorrelation functions over several frames because the effect of suddennoise has to be removed and there is a finding that a vowel segment hasa duration of 50 to 150 msec as mentioned above. Therefore, as shown inFIG. 5(c), there is no reaction to the sudden sound which occurs in thevicinity of the 50th frame and the correlation values remain small.

Next, the difference processing unit 204 calculates the average value ofthe correlation values for a predetermined time period calculated by theinter-frame feature correlation value calculation unit 203, andsubtracts the average value from the correlation value of each frame soas to obtain the correlation value corrected by the average difference(S8). That is because it is deemed that the effect of periodic noisewhich occurs for a long time can be removed by subtracting the averagevalue from the correlation value. Here, the average value of thecorrelation values for five seconds or so is calculated, and FIG. 5(c)shows the average value in solid line 502. More specifically, a segmentin which the correlation values appear above the solid line 502 is asegment in which the correlation values corrected by the above-mentionedaverage difference are plus values.

Next, the speech segment determination unit 205 determines the speechsegment based on the correlation values corrected from the correlationvalues E1(j) by the difference processing unit 204 using the averagedifference, according to the following three segment correction methods:selection using correlation values; use of segment duration; andconcatenation of segments taking a consonant segment and choked soundsegment into consideration (S10).

A description is given in more detail of the speech segmentdetermination processing by the speech segment determination unit 205(S10 in FIG. 2). FIG. 6 is a flowchart showing the details of the speechsegment determination processing per voice utterance.

First, judgment of a segment using a correlation value, that is thefirst segment correction method, is described below. The speech segmentdetermination unit 205 checks, as for a current frame, whether thecorrected correlation value calculated by the difference processing unit204 is larger than a predetermined threshold value or not (S44). Forexample, in the case where the predetermined threshold value is 0, suchchecking is equivalent to checking whether the correlation value shownin FIG. 5(c) is larger than the average value of the correlation values(solid line 502).

When the corrected correlation value is larger than the threshold value(YES in S44), it is judged that the current frame is a speech frame(S46), and when the corrected correlation value is equal to or smallerthan the predetermined threshold value (NO in S44), it is judged thatthe current frame is a non-speech frame (S48). The above-mentionedspeech judgment processing (S44 to S48) is repeated for all the framesin which speech segments are to be detected (S42 to S50). As a result ofthe above-mentioned processing, a graph shown in FIG. 5(d) is obtained,and a segment in which speech frames continue is detected as a voicedsegment.

As described above, when the corrected correlation value is equal to orsmaller than the threshold value, it is judged that the frame is anon-speech frame. However, a corrected correlation value expected in adetected segment varies depending on effects of noise levels and variousconditions of acoustic features. Therefore, it is also possible todetermine and use a threshold value for distinguishing between a speechframe and a non-speech (noise) frame as appropriate through previousexperiments. Using this processing for such stricter selection criterionfor a harmonic structure signal, it can be expected to distinguish, as anon-speech frame, a periodic noise having a shorter time period than thetime length used for calculation of the average difference, for example,500 ms or so.

Next, a method for concatenating adjacent voiced segments, namely, thesecond segment correction method is described below. The speech segmentdetermination unit 205 checks whether a distance (that is the number offrames located) between a current voiced segment and another voicedsegment adjacent to the current segment is less than a predeterminednumber of frames (S54). For example, the predetermined number of framesshall be 30 here. When the distance is less than 30 frames (YES in S54),adjacent two voiced segments are concatenated (S56). The above-mentionedprocessing (S54 to S56) is performed for all the voiced segments (S52 toS58). As a result of the above-mentioned processing for concatenatingvoiced segments, a graph shown in FIG. 5(e) is obtained which shows thatvoiced segments which are close to each other are concatenated.

Voiced segments are concatenated for the following reason. Harmonicstructures hardly appear in a consonant segment, particularly in anunvoiced consonant segment such as a plosive (/k/, /c/, /t/ and /p/) anda fricative, so the correlation value of such a segment is small and thesegment is hardly detected as a voiced segment. However, since there isa vowel near a consonant, a segment in which vowels continue is regardedas a voiced segment. Therefore, it becomes possible to regard theconsonant segment as a voiced segment, too.

Finally, a segment duration that is the third segment correction methodis described below. The speech segment determination unit 205 checkswhether or not the duration of a current voiced segment is longer than apredetermined time period (S62). For example, the predetermined timeperiod shall be 50 msec. When the duration is longer than 50 msec (YESin S62), it is determined that the current voiced segment is a speechsegment (S64), and when the duration is equal to or shorter than 50 msec(NO in S62), it is determined that the current voiced segment is anon-speech segment (S66). By performing the above-mentioned processing(S62 to S66) for all the voiced segments, speech segments are determined(S60 to S68). As a result of the above-mentioned processing, a graphshown in FIG. 5(f) is obtained and a speech segment is detected around110th to 280th frames. This diagram shows that a voiced segmentcorresponding to a periodic noise which exists around 325th frame in thegraph of FIG. 5(e) is determined to be a non-speech segment. Asdescribed above, in the processing for selecting voiced segments basedon their durations, it becomes possible to remove periodic noise havinga shorter duration and a higher correlation value.

According to the present embodiment as described above, a voiced segmentis determined by evaluating the inter-frame continuity of harmonicstructure spectral components. Therefore, it is possible to determinespeech segments more accurately than the conventional method fortracking local peaks.

Particularly, the continuity of harmonic structures is evaluated basedon the inter-frame correlation values of spectral components. Therefore,it is possible to evaluate such continuity while remaining moreinformation of the harmonic structures than the conventional method forevaluating the continuity of the harmonic structures based on theamplitude difference between frames. Therefore, even in the case where asudden noise occurs over a short period of frames, such sudden noise isnot detected as a voiced segment.

Furthermore, a speech segment is determined by concatenating temporallyadjacent voiced segments. Therefore, it is possible to determine notonly vowels but also consonants having more indistinct harmonicstructures than the vowels to be speech segments. It also becomespossible to remove noise having periodicity by evaluating the durationof a voiced segment.

Second Embodiment

A description is given below, with reference to the drawings, of aspeech segment detection device according to the second embodiment ofthe present invention. The speech segment detection device according tothe present embodiment is different from the speech segment detectiondevice according to the first embodiment in that the former determines aspeech segment only based on the inter-frame correlation of spectralcomponents in the case of a high SNR.

FIG. 7 is a block diagram showing a hardware structure of a speechsegment detection device 30 according to the present embodiment. Thesame reference numbers are assigned to the same constituent elements asthose of the speech segment detection device 20 in the first embodiment.Since their names and functions are also same, the description thereofis omitted as appropriate. Note that the description thereof is alsoomitted as appropriate in the following embodiments.

The speech segment detection device 30 is a device which determines, inan input signal, a speech segment that is a segment during which a manutters a sound, and includes the FFT unit 200, the harmonic structureextraction unit 201, a voiced feature evaluation unit 210, an SNRestimation unit 206 and the speech segment determination unit 205.

The voiced feature evaluation unit 210 is a device which extracts avoiced segment, and includes the feature storage unit 202, theinter-frame feature correlation value calculation unit 203 and thedifference processing unit 204.

The SNR estimation unit 206 estimates the SNR of an input signal basedon the correlation value corrected using the average differenceoutputted from the difference processing unit 204. The SNR estimationunit 206 outputs the corrected correlation value outputted from thedifference processing unit 204 to the speech segment determination unit205 when it is estimated that the SNR is low, while it does not outputthe corrected correlation value to the speech segment determination unit205 but determines the speech segment based on the corrected correlationvalue outputted from the difference processing unit 204 when it isestimated that the SNR is high. This is because an input signal has aproperty that the difference between a speech segment and a non-speechsegment becomes clear when the SNR of the input signal is high.

Next, a description is given of a method for estimation of the SNR of aninput signal by the SNR estimation unit 206. When the average value ofcorrelation values calculated by the difference processing unit 204 issmaller than the threshold value, the SNR estimation unit 206 estimatesthat the SNR is high, and when the average value is equal to or largerthan the threshold value, it estimates that the SNR is low. This isbecause the following reasons. When the average value of correlationvalues is calculated over a time period longer enough than the durationof one utterance (for example, five seconds), the correlation valuesdecrease in the noise segment under the high SNR environment, so theaverage value of these correlation values also decrease. On the otherhand, under the low SNR environment having a periodic noise or the like,the correlation values increase in the noise segment, so the averagevalue of these correlation values also increase. Using this property oflinkage between the average value of correlation values and the SNR, itbecomes possible to easily estimate the SNR just by evaluating onealready-calculated parameter.

The operation of the speech segment detection device 30 structured asabove is described below. FIG. 8 is a flowchart of the processingperformed by the speech segment detection device 30.

The operations of the speech segment detection device 30 from the FFTprocessing by the FFT unit 200 (S2) through the corrected correlationvalue calculation processing by the difference processing unit 204 (S8)are same as those of the speech segment detection device 20 of the firstembodiment shown in FIG. 2. Therefore, the detailed description thereofis not repeated here.

Next, the SNR estimation unit 206 estimates the SNR of the input signalaccording to the above method (S12). When it is estimated that the SNRis high (YES in S14), the SNR estimation unit 206 determines that asegment of the corrected correlation value which is larger than apredetermined threshold value is a speech segment. When it estimatesthat the SNR is low (NO in S14), it performs the same processing as thespeech segment determination processing (S10 in FIG. 2) performed by thespeech segment determination unit 205 in the first embodiment which aredescribed with reference to FIG. 2 and FIG. 6, and determines speechsegments (S10).

As described above, the present embodiment brings about the advantagethat there is no need to perform the speech segment determinationprocessing based on the continuity and duration of speech segments, inaddition to the advantages described in the first embodiment. Therefore,it becomes possible to detect speech segments in almost real time.

Third Embodiment

A description is given below, with reference to the drawings, of aspeech segment detection device according to the third embodiment of thepresent invention. The speech segment detection device according to thepresent embodiment is capable not only of determining speech segmentshaving harmonic structures but also of distinguishing particularlybetween music and human voices.

FIG. 9 is a block diagram showing a hardware structure of a speechsegment detection device 40 according to the present embodiment. Thespeech segment detection device 40 is a device which determines, in aninput signal, a speech segment that is a segment during which a manvocalizes and a music segment that is a segment of music. It includesthe FFT unit 200, a harmonic structure extraction unit 401 and aspeech/music segment determination unit 402.

The harmonic structure extraction unit 401 is a processing unit whichoutputs values indicating harmonic structure features, based on thepower spectral components extracted by the FFT unit 200. Thespeech/music segment determination unit 402 is a processing unit whichdetermines speech segments and music segments based on the valuesindicating the harmonic structures outputted from the differenceprocessing unit 204.

The operation of the speech segment detection device 40 structured asabove is described below. FIG. 10 is a flowchart of the processingperformed by the speech segment detection device 40.

The FFT unit 200 obtains, as acoustic features used for extraction ofharmonic structures, power spectral components by performing FFT on aninput signal (S2).

Next, the harmonic structure extraction unit 401 extracts the valuesindicating the harmonic structures from the power spectral componentsextracted by the FFT unit 200 (S82). The harmonic structure extractionprocessing (S82) is described later in detail.

The harmonic structure extraction unit 401 determines speech segmentsand music segments based on the values indicating the harmonicstructures (S84). The speech/music segment determination processing(S84) is described later in detail.

Next, a detailed description of the above-mentioned harmonic structureextraction processing is given below (S82). In the harmonic structureextraction processing (S82), the value indicating the harmonic structurefeature is obtained based on the correlation between frequency bandswhen the power spectral component is divided into a plurality offrequency bands. The value indicating the harmonic structure feature isobtained using this method because of the following reason. When it isassumed that the harmonic structure is seen in the frequency band whichclearly shows the effect of the signal of speech generated by the vocalfold vibration that is the source of that harmonic structure, it can beestimated that there is high correlation of power spectral componentsbetween adjacent frequency bands. In other words, as shown in FIG. 11,in the case where the power spectral component indicated on the verticalaxis is separated into a plurality of frequency bands (the number offrequency bands is 8 in this diagram) in each frame indicated on thehorizontal axis, there is a high correlation between the frequency bandswith harmonic structures (for example, between the band 608 and the band606), while there is a low correlation between the frequency bandswithout harmonic structures (for example, between the band 602 and theband 604).

FIG. 12 is a flowchart showing the details of the harmonic structureextraction processing (S82). The harmonic structure extraction unit 401calculates each inter-band correlation value C(i, k) in each frame, asmentioned above (S92). The inter-band correlation value C(i, k) isrepresented by the following equation (6).C(i, k)=max(Xcorr(P(i,L*(k+1)+1:L*k), P(i,L*k+1:L*(k+1))))   (6)

Here, P(i, x:y) represents a vector sequence where a frequency componentx:y (larger than x and smaller than y) in a power spectrum in a frame i.L represents a bandwidth, and max(Xcorr(•)) represents the maximum valueof correlation coefficients between vector sequences.

Since there is a high correlation between adjacent frequency bands withharmonic structures, the inter-band correlation value C(i, k) indicatesa larger value. On the contrary, since there is a low correlationbetween adjacent frequency bands without harmonic structures, theinter-band correlation value C(i, k) indicates a smaller value.

Note that the inter-band correlation value C(i, k) may be obtained bythe following equation (7).C(i, k)=max(Xcorr(P(i,L*(k−1)+1:L*k), P(i+1, L*k+1:L*(k+1))))   (7)

Note that the equation (6) represents the correlation of power spectralcomponents between adjacent frequency bands in the same frame, like theband 608 and the band 606 or the band 604 and the band 602, while theequation (7) represents the correlation of power spectral componentsbetween adjacent frequency bands in adjacent frames, like the band 608and the band 610. Based on the correlation between not only adjacentbands but also adjacent frames as shown by the equation (7), it becomespossible to calculate the correlation between bands and the correlationbetween frames at the same time.

Furthermore, the inter-band correlation value C(i, k) may be calculatedby the following equation (8).C(i,k)=max(Xcorr(P(i,L*(k−1)+1:L*k),P(i,L*(k−1)+1:L*(k+1))))   (8)

The equation (8) represents the correlation of power spectra in the samefrequency band between adjacent frames.

Next, [R(i), N(i)], that is, a pair of the harmonic structure value R(i)indicating the harmonic structure feature in the frame i and thefrequency band number N(i) is obtained (S94). [R(i), N(i)] isrepresented by the following equation (9).[R(i), N(i)]=[R 1(i)−R 2(i), N 1(i)−N 2(i)]  (9)

Here, R1(i) and R2(i) are represented as follows: $\begin{matrix}{{{R_{1}(i)} = {\max\limits_{k = {{1\quad\ldots\quad L} - 1}}\left( {C\left( {i,k} \right)} \right)}};} & (10) \\{{{R_{2}(i)} = {\min\limits_{k = {{1\quad\ldots\quad L} - 1}}\left( {C\left( {i,k} \right)} \right)}};} & (11)\end{matrix}$

C: Frequency band harmonic scale in frequency band k of frame i

L: Number of frequency bands

N1(i) and N2(i) represent the number of frequency bands in which C(i, k)has the maximum and minimum values respectively. The harmonic structurevalue represented by the equation (9) is obtained by subtracting theminimum value from the maximum value of the inter-band correlation valuein the same frame. Therefore, the harmonic structure value is larger inthe frame with a harmonic structure, while the value is smaller in theframe without a harmonic structure. There is also an advantage in thesubtraction of the minimum value from the maximum value that theinter-band correlation value is normalized. Therefore, it becomespossible to perform the normalization processing in one frame withoutperforming the processing for obtaining the difference from the averagecorrelation value like the processing of S8 in FIG. 2,

Next, the harmonic structure extraction unit 401 calculates thecorrected band numbers Nd(i) which are obtained by assigning weights onthe band numbers N(i) according to the distributions thereof in the pastXc frames (S96). The harmonic structure extraction unit 401 obtains themaximum value Ne(i) of the corrected band numbers Nd(i) in the past Xcframes (S98). The maximum value Ne(i) is hereinafter referred to as aweighted band number.

The corrected band number Nd(i) and the weighted band number Ne(i) areobtained by the following equations in the case of Xc=5. $\begin{matrix}{{{{Nd}(i)} = {{\underset{k = {i - {{Xc}\text{:}i}}}{median}\left( {N(k)} \right)} - {\underset{k = {i - {{Xc}\text{:}i}}}{var}\left( {N(k)} \right)}}};} & (12) \\{{{{Ne}(i)} = {\max\limits_{k = {{i\text{:}i} + {Xc}}}\left( {{Nd}(k)} \right)}};} & (13)\end{matrix}$

Nd: Frequency band number corrected based on distribution

Ne: Maximum value of band numbers Nd of past Xc frames corrected basedon distribution

Xc: Frame width for distribution calculation

In the segment without a harmonic structure, the band numbers N(i) aredistributed widely. Therefore, the value of the corrected band numbersNd(i) become smaller (for example, minus values), and the value of theweighted band number Ne(i) becomes smaller accordingly.

Furthermore, the harmonic structure extraction unit 401 corrects theharmonic structure value R(i) with the weighted band number Ne(i) so asto calculate the corrected harmonic structure value R′(i) (S100). Thecorrected harmonic structure value R′(i) is obtained by the followingequation (14). Note that as the harmonic structure value R(i), the valuecalculated in S8 may be used here.R′(i)=R(i)*Ne(i)   (14)

FIG. 13 to FIG. 15 are diagrams showing the experimental results of theabove-mentioned harmonic structure extraction processing (S82).

FIG. 13 is a diagram showing an experimental result in the case where aman utters a sound under the environment of a vacuum cleaner noise(SNR=10 dB). It is assumed that a sudden sound “click” which is madewhen the vacuum is turned on appears around the 40th frame, and thesound level of the vacuum increases and a periodic noise appears around280th frame when the rotation speed of the motor is changed from low tohigh. It is also assumed that the man utters sounds during the periodfrom the 80th frame to the 280th frame.

FIG. 13(a) shows power spectra of an input signal, FIG. 13(b) showsharmonic structure values R(i), FIG. 13(c) shows band numbers N(i), FIG.13(d) shows weighted band numbers Ne(i), and FIG. 13(e) shows correctedharmonic structure values R′(i). Note that the band numbers shown inFIG. 13(c) indicate lower frequencies as they come close to 0 becausethey are obtained by multiplying the actual band numbers by −1 forbetter showing.

As shown in FIG. 13(c), in parts in which a sudden sound and a periodicnoise appear (parts enclosed by broken lines in this diagram), the bandnumbers N(i) fluctuate largely. Therefore, as shown in FIG. 13(d), theweighted band numbers Ne(i) corresponding to those parts have smallervalues, and the corrected harmonic structure values decreaseaccordingly, as shown in FIG. 13(e).

FIG. 14 is a diagram showing an experimental result in the case wherethe same sound is produced as that in FIG. 13 under the environment inwhich a noise of a vacuum cleaner hardly appears. Also in thisenvironment, the corrected harmonic structure values R′(i) in the partswithout harmonic structures are smaller (FIG. 14(e)), as is the casewith FIG. 13.

FIG. 15 is a diagram showing an experimental result of music withoutvocals. Music has harmonic structures because harmonies are outputted,but it does not have a harmonic structure in the segment during which adrum is beaten or the like. FIG. 15(a) shows power spectra of an inputsignal, FIG. 15(b) shows harmonic structure values R(i), FIG. 15(c)shows band numbers N(i), FIG. 15(d) shows weighted band numbers Ne(i),and FIG. 15(e) shows corrected harmonic structure values. Note that theband numbers shown in FIG. 15(c) indicate the lower frequencies as thevalues thereof come close to 0 for the same reason as FIG. 13(c). In thesections enclosed with broken lines, harmonic structures are lost due tothe beating of the drum. As a result, the weighted band numbers Ne(i)decrease in those sections, as shown in FIG. 15(d). Therefore, as shownin FIG. 15(e), the corrected harmonic structure values R′(i) alsodecrease. The corrected harmonic structure values R′(i) decrease in theunvoiced segment, too.

Note that in the processing of S94, it is also possible to obtain a pair[R(i), N(i)] of a harmonic structure value R(i) and a band number N(i)indicating a harmonic structure in a frame i according to the followingequation (15).[R(i), N(i)]=[R 1(i)−R 2(i), N 1(i)−N 2(i)]  (15)

Here, R1(i) and R2(i) are represented as follows: $\begin{matrix}{{R_{1}(i)} = {\sum\limits_{k = {1\quad\ldots\quad{NSP}}}\quad\left( {C\left( {i,k} \right)} \right)}} & (16) \\{{R_{2}(i)} = {\sum\limits_{k = {L - {{NSP}\quad\ldots\quad L} - 1}}^{\backslash}\quad\left( {C\left( {i,k} \right)} \right)}} & (17)\end{matrix}$

C: Frequency band harmonic scale in band k of frame i

L: Number of bands

NSP: Number of bands which are assumed to be speech pitch frequencybands

N1(i) and N2(i) represent the maximum and minimum numbers of bands atwhich C(i, k) has the maximum value and the minimum value respectively.

Note that R1(i) or R2(i) may be a harmonic structure value R(i).

FIG. 16 shows an experimental result in which weighted harmonicstructure values R′(i) are obtained according to the equation (15). FIG.16 is a diagram showing an experimental result in the case where a manutters a sound under the environment in which there is quiteconsiderable noise of a vacuum cleaner (SNR=0 dB). Note that the timingat which the man utters the sound and the timings at which the suddensound and periodic noise of the vacuum cleaner appear are same as thoseshown in FIG. 13. The values shown here are obtained in the equation(15) in the case of L=16 and NSP=2.

In this case, the weighted harmonic structure values R′(i) are largervalues in the frames in which the man utters the sounds, while they aresmaller values in the frames in which the sudden sound and periodicnoise appear.

Next, a detailed description is given below of the speech/music segmentdetermination processing (S84 in FIG. 10). FIG. 17 is a detailedflowchart of the speech/music segment determination processing (S84 inFIG. 10).

The speech/music segment determination unit 402 checks whether or not apower spectrum P(i) in a frame i is larger than a predeterminedthreshold value Pmin (S112). When the power spectrum P(i) is equal to orsmaller than the predetermined threshold value Pmin (NO in S112), itjudges that the frame i is a silent (unvoiced?) frame (S126). When thepower spectrum P(i) is larger than the predetermined threshold valuePmin (YES in S112), it judges whether or not the corrected harmonicstructure value R′(i) is larger than a predetermined threshold valueRmin (S114).

When the corrected harmonic structure value R′(i) is equal to or smallerthan the predetermined threshold value Rmin (NO in S114), thespeech/music segment determination unit 402 judges that the frame i is aframe of a sound without a harmonic structure (S124). When the correctedharmonic structure value R′(i) is larger than the predeterminedthreshold value Rmin (YES in S114), the speech/music segmentdetermination unit 402 calculates the average value per unit timeave_Ne(i) of the weighted band numbers Ne(i) (S116), and checks whetheror not the average value per unit time ave_Ne(i) is larger than apredetermined threshold value Ne_min (S118). Here, ave_Ne(i) is obtainedaccording to the following equation. Here, it represents the averagevalue of Ne(i) in d frames (50 frames here) including the frame i.$\begin{matrix}{{{{ave\_ Ne}(i)} = {\underset{k = {i - {d\text{:}i}}}{average}\left( {{Ne}(i)} \right)}};} & (18)\end{matrix}$

d: Number of frames for which average value per unit time is obtained

When ave_Ne(i) is larger than the predetermined threshold value Ne_min(YES in S118), it is judged to be music (S120), and in other cases (NOin S118), it is judged to be the sound like human voices with harmonicstructures (S122). The above-mentioned processing (S112 to S126) isrepeated for all the frames (S110 to S128).

Note that music and speech are separated in sounds with harmonicstructures based on the sizes of the values ave_Ne(i) because of thefollowing fact. Both signals of music and speech are the sounds withharmonic structures. However, in speech, voiced sound and unvoiced soundappear repeatedly, so the harmonic structure values are larger in thevoiced sound part and smaller in the unvoiced sound part, and these twoparts appear alternately at short segments. On the other hand, in music,harmonies are outputted continuously, so the part with harmonicstructure continues for a relatively long time and thus the largerharmonic structure values are maintained. This shows that the harmonicstructure values do not fluctuate so much in music, while they fluctuatemuch in speech. In other words, the average value per unit time of theweighted band numbers Ne(i) is larger in music than in speech.

Note that it is also possible to distinguish between speech and music byfocusing attention on the temporal continuity of harmonic structurevalues. In other words, it is possible to check how many frames have thesmaller harmonic structure values per unit time. For that purpose, thenumber of frames in which the weighted band number Ne(i) is a minusvalue per unit time, for example may be counted. In the case where thenumber of frames in which the weighted band number Ne(i) is minus perunit among the frames (past 50 frames including the current frame i, forexample) is Ne_count(i), it is possible to calculate Ne_count(i) insteadof ave_Ne(i) in S116, and determine the segment to be speech when thenumber of frames Ne_count(i) is larger than a predetermined thresholdvalue in S118 while determine the segment to be music when the number offrames is equal to or smaller than the predetermined threshold value.

As described above, in the present embodiment, a power spectralcomponent in each frame is divided into a plurality of frequency bandsand correlations between bands are obtained. Therefore, it becomespossible to extract the frequency band in which the effect of a signalof speech generated by vocal fold vibration is properly reflected, andthus to extract a harmonic structure without fail.

Furthermore, it becomes possible to judge whether a sound with aharmonic structure is music or speech, based on the fluctuation orcontinuity of harmonic structures.

Fourth Embodiment

Next, a description is given, with reference to the drawings, of aspeech segment detection device according to the fourth embodiment ofthe present invention. The speech segment detection device in thepresent embodiment determines speech segments with harmonic structuresbased on the distribution of harmonic structure values.

FIG. 18 is a block diagram showing a hardware structure of a speechsegment detection device 50 according to the fourth embodiment. Thespeech segment detection device 50 is a device which detects speechsegments with harmonic structures in an input signal, and includes theFFT unit 200, a harmonic structure extraction unit 501, the SNRestimation unit 206 and a speech segment determination unit 502.

The harmonic structure extraction unit 501 is a processing unit whichoutputs the values indicating harmonic structures based on the powerspectral components outputted from the FFT unit 200. The speech segmentdetermination unit 502 is a processing unit which determines speechsegments based on the values indicating harmonic structures and theestimated SNR values.

The operation of the speech segment detection device 50 structured asabove is described below. FIG. 19 is a flowchart of the processingperformed by the speech segment detection device 50. The FFT unit 200obtains the power spectral components as acoustic features to be usedfor extraction of harmonic structures by performing FFT on the inputsignal (S2).

Next, the harmonic structure extraction unit 501 extracts the valuesindicating harmonic structures from the power spectral componentsextracted by the FFT unit 200 (S140). The harmonic structure extractionprocessing (S140) is described later.

The SNR estimation unit 206 estimates the SNR of the input signal basedon the values indicating the harmonic structures (S12). The method forestimating SNR is same as the method in the second embodiment.Therefore, a detailed description thereof is not repeated here.

The speech segment determination unit 502 determines speech segmentsbased on the values indicating harmonic structures and the estimated SNRvalues (S142). The speech segment determination processing (S142) isdescribed later in detail.

In the present embodiment, the accuracy of determining speech segmentsis improved by adding the evaluation of the transition segments betweena voiced sound and an unvoiced sound. According to the speech segmentdetermination method shown in FIG. 6, (1) speech segments areconcatenated when the distance between them is shorter than that of apredetermined number of frames (S52), and (2) the concatenated speechsegment is judged to be a non-speech segment when the duration of thatsegment is shorter than a predetermined time period (S60). In otherwords, this is the method in which it is implicitly expected that, bythe processing (2), an unvoiced segment is concatenated with a speechsegment which is judged to be a voiced segment in the processing (1),without evaluation of the frame between the unvoiced segment and thevoiced segment.

When speech segments are seen in detail, it is deemed that speechsegments can be categorized into the following three groups (Group A,Group B and Group C) according to the transition types between voicedsound, unvoiced sound and noise (non-speech segment).

Group A is a voiced sound group, and can include the followingtransition types: from a voiced sound to a voiced sound; from a noise toa voiced sound; and from a voiced sound to a noise.

Group B is a group of a mixture of a voiced sound and an unvoiced sound,and can include the following transition types: from a voiced sound toan unvoiced sound; and from an unvoiced sound to a voiced sound.

Group C is a non-speech group, and can include the following transitiontypes: from an unvoiced sound to an unvoiced sound; from an unvoicedsound to a noise; from a noise to an unvoiced sound; and from a noise toa noise.

As for the sound included in Group A, only the voiced segments aredetermined depending on the accuracy of the values indicating theirharmonic structures. On the other hand, as for the sound included inGroup B, it can be expected that an unvoiced segment can also beextracted if the transition of sound around a voiced segment can beevaluated. As for the sound included in Group C, it seems to be verydifficult to extract only an unvoiced sound under noise environment.This is because the noise features cannot be defined easily or the SNRfor unvoiced noise is often low.

Therefore, in the present embodiment, the sound of Group B is extractedby evaluating the transition between a voiced sound and an unvoicedsound, in addition to the method of FIG. 6 in which speech segments aredetermined by extracting only the sound of Group A. As a result, webelieve that the accuracy of determining speech segments can beimproved. Furthermore, it can be assumed that the values indicatingharmonic structures significantly change in the transition segments froman unvoiced sound to a voiced sound and from a voiced sound to anunvoiced sound. Therefore, it becomes possible to recognize this changein values of harmonic structures, by using a scale of the distributionof the values indicating harmonic structures in the surroundings of thesegment which is judged to be a voiced segment using these values. Here,the distribution of the values indicating harmonic structures is calleda weighted distribution Ve.

Next, a detailed description of the harmonic structure extractionprocessing (S140 in FIG. 19) is given below. FIG. 20 is a flowchartshowing the details of the harmonic structure extraction processing(S140).

The harmonic structure extraction unit 501 calculates an inter-bandcorrelation value C(i, k) for each frame (S150). The inter-bandcorrelation value C(i, k) is calculated in the same manner as S92 inFIG. 12. Therefore, a detailed description thereof is not repeated here.

Next, the harmonic structure extraction unit 501 calculates a weighteddistribution Ve(i) using the inter-band correlation value C(i, k),according to the following equation (S152). $\begin{matrix}{{{Ve}(i)} = {\underset{k = {1\text{:}L}}{count}\left( {{{if}{\underset{j = {i - {{Xc}\text{:}i}}}{var}\left( {C\left( {j,k} \right)} \right)}} > {{th\_ var}{\_ change}}} \right)}} & (19)\end{matrix}$where Xc: Frame width (=16)

L: Number of frequency bands (=16)

th_var_change: Threshold value

It is assumed that a function var( ) is a function representing thedistribution of values in the parentheses, and a function count( ) is afunction for counting the number of satisfied conditions among theconditions in the parentheses.

Finally, the harmonic structure extraction unit 501 calculates theharmonic structure value R(i) (S154). This calculation method is same asS94 in FIG. 12. Therefore, a detailed description thereof is notrepeated here.

Next, a description of the speech segment determination processing (S142in FIG. 19) is given with reference to FIG. 21. The speech segmentdetermination unit 502 judges whether or not R(i) of a frame i is largerthan a threshold value Th_R and whether or not Ve(i) is larger than athreshold value Th_ve (S182). When the above-mentioned conditions areboth satisfied (YES in S182), the speech segment determination unit 502judges that the frame i is a speech frame, and when the conditions arenot satisfied, it judges that the frame i is a non-speech frame (S186).The speech segment determination unit 502 performs the above-mentionedprocessing for all the frames (S180 to S188). Next, the speech segmentdetermination unit 502 judges whether the SNR estimated by the SNRestimation unit 206 is low or not (S190), and when the estimated SNR islow, it performs the processing of Loop B and Loop C (S52 to S68). Theprocessing of Loop B and Loop C is same as that shown in FIG. 6.Therefore, a detailed description thereof is not repeated here.

Note that when the estimated SNR is high (NO in S190), it omits Loop Band performs only the processing of Loop C (S60 to S68).

FIG. 22 and FIG. 23 are diagrams showing the results of the processingexecuted by the speech segment detection device 50. FIG. 22 is a diagramshowing an experimental result in the case where a man utters a soundunder the environment in which there is a noise of a vacuum cleaner(SNR=10 dB). It is assumed that a sudden sound “click” which is madewhen the vacuum is turned on appears around the 40th frame, and thesound level of the vacuum increases around the 280th frame when therotation speed of the motor is changed from low to high and thus aperiodic noise appears there. It is assumed that the man utters thesound during the segment between around the 80th frame and around the280th frame.

FIG. 22(a) shows power spectra of an input signal, FIG. 22(b) showsharmonic structure values R(i), FIG. 22(c) shows weighted distributionsVe(i), FIG. 22(d) shows speech segments before being concatenated, andFIG. 22(e) shows speech segments after being concatenated.

In FIG. 22(d), solid lines indicate speech segments obtained byperforming the threshold value processing (Loop A (S42 to S50) in FIG.6) on the harmonic structure values R(i), and broken lines indicatespeech segments obtained by performing the threshold value processing(Loop A (S180 to S188) in FIG. 21) on the harmonic structure values R(i)and the weighted distributions Ve(i). In FIG. 22(e), a broken lineindicates a processing result obtained after concatenating the speechsegments indicated by the broken lines in FIG. 22(d) according to thesegment concatenation processing (S190 to S68 in FIG. 21), and solidlines indicate a processing result obtained after concatenating thespeech segments indicated by the solid lines in FIG. 22(d) according tothe segment concatenation processing (S52 to S68 in FIG. 6). As shown inFIG. 22(e), it becomes possible to extract the speech segment accuratelyusing the weighted distributions Ve(i).

FIG. 23 is a diagram showing an experimental result in the case where aman utters the same sound as that shown in FIG. 22 under the environmentin which there hardly appears the vacuum noise (SNR=40 dB). The graphsin FIG. 23(a) to FIG. 23(e) mean the same thing as the graphs in FIG.22(a) to FIG. 22(e). When comparing, in FIG. 23, FIG. 23(d) showing thespeech segments before being concatenated and FIG. 23(e) showing thespeech segments after being concatenated, the result of S180 indicatedby broken lines in FIG. 23(d) shows that the speech segments areaccurately concatenated in the same manner as indicated by solid linesin FIG. 23(e). Therefore, when the estimated SNR is very high, it ispossible to maintain the high performance for detecting speech segmentsaccording to the judgment processing of S190 in FIG. 21, even if thespeech segments are determined without performing the processing of S52to S58.

As described above, according to the present embodiment, it becomespossible to extract the sounds belonging to the above Group B byevaluating transition segments between voiced sounds and unvoiced soundsusing the weighted distributions Ve. As a result, it becomes possible toextract speech segments accurately without concatenating the segments,in the case where it is judged using an estimated SNR that the SNR ishigh. In addition, it becomes possible to reduce mis-detections of anoise segment as a speech segment because the predetermined number offrames to be concatenated (S54 in FIG. 21) can be decreased even if SNRis low and the segments need to be concatenated.

Note that it is also possible to calculate corrected harmonic structurevalues R′(i) instead of harmonic structure values R(i) so as to detect aspeech segment based on the weighted distributions Ve(i) and thecorrected harmonic structure values R′(i). FIG. 24 is a flowchartshowing another example of the harmonic structure extraction processing(S140 in FIG. 19).

The harmonic structure extraction unit 501 calculates an inter-bandcorrelation value C(i, k), a weighted distribution Ve(i) and a harmonicstructure value R(i) (S160 to S164). The method for calculating these issame as that shown in FIG. 20, a detailed description thereof is notrepeated here. Next, the harmonic structure extraction unit 501calculates the weighted harmonic structure value Re(i) (S160). Theweighted harmonic structure value Re(i) is calculated according to thefollowing equations. These equations are different from the equationsused for the calculation in S96/S98 in that the harmonic structure valueR(i) of the frame i calculated in S94 is used in the former equations,while the band number N(i) thereof is used in the latter equations. Bothof these equations are corrected by weighted distribution so as to bethe indices for accentuating the harmonic structure. $\begin{matrix}{{{{Rd}(i)} = {{\underset{k = {i - {{Xc}\text{:}i}}}{median}\left( {R(k)} \right)} - {\underset{k = {i - {{Xc}\text{:}i}}}{var}\left( {R(k)} \right)}}};} & (20) \\{{{{Re}(i)} = {\max\limits_{k = {{i\text{:}i} + {Xc}}}\left( {{Rd}(k)} \right)}};} & (21)\end{matrix}$

Xc: Frame width for calculation of distribution (=5)

where the function mediano indicates the median value in theparentheses.

The harmonic structure extraction unit 501 calculates the correctedharmonic structure value R′(i) (S168). The corrected harmonic structurevalue R′(i) is calculated according to the following equations.R′(i)=Re(i);:if Re(i)>0;   (22)R′(i)=0;:if Re(i)<0;   (23)

FIG. 25 and FIG. 26 are diagrams showing the result of the processingexecuted according to the flowchart shown in FIG. 24. FIG. 25 shows anexperimental result in the case where a man utters a sound under theenvironment in which there is no noise of a vacuum cleaner (SNR=40 dB),while FIG. 26 shows an experimental result in the case where the manutters the sound under the environment in which there appears the vacuumnoise (SNR=10 dB). It is assumed that in this experiment, the man uttersthe same sound as that shown in FIG. 23 and the sudden sound andperiodic noise also appear at the same timings as those in FIG. 23.

FIG. 25(a) shows an input signal, FIG. 25(b) shows power spectra of theinput signal, FIG. 25(c) shows harmonic structure values R(i), FIG.25(d) shows weighted harmonic structure values Re(i), and FIG. 25(e)shows corrected harmonic structure values R′(i). FIG. 26(a) to FIG.26(e) also show the similar graphs to those shown in FIG. 25(a) to FIG.25(e).

The corrected harmonic structure values R′(i) are calculated based onthe distribution of the harmonic structure values R(i) themselves.Therefore, it becomes possible to properly extract a part with aharmonic structure using the property that there appears a widerdistribution in the part with a harmonic structure while there appears anarrower distribution in the part without a harmonic structure.

Fifth Embodiment

Each of the speech segment detection devices according to theabove-mentioned first through fourth embodiments determines a speechsegment in an input signal of speech which is previously recorded in afile or the like. This type of processing method is effective when, forexample, the processing is performed on already recorded data, butunsuitable for determining a segment during reception of speech.Therefore, in the present embodiment, a description is given of a speechsegment detection device which determines a speech segment insynchronism with reception of speech.

FIG. 27 is a block diagram showing a structure of a speech segmentdetection device 60 according to the present embodiment of the presentinvention. The speech segment detection device 60 is a device whichdetects a speech segment with a harmonic structure (harmonic structuresegment) in an input signal, and includes the FFT unit 200, a harmonicstructure extraction unit 601, a harmonic structure segment finaldetermination unit 602 and a control unit 603.

FIG. 28 is a flowchart of processing performed by the speech segmentdetection device 60. The control unit 603 sets FR, FRS, FRE, RH, RM CH,CM and CN to be 0 (S200). Here, FR indicates the number of the firstframe among the frames in which the harmonic structure values R(i) to bedescribed later are not yet calculated. FRS indicates the number of thefirst frame in the segment which is not yet determined to be a harmonicstructure segment or not. FRE indicates the number of the last frame onwhich the harmonic structure frame provisional judgment processing to bedescribed later is performed. RH and RM indicate the accumulated valuesof the harmonic structure values. CH and CN are counters.

The FFT unit 200 performs FFT on an input frame. The harmonic structureextraction unit 601 extracts a harmonic structure value R(i) based onthe power spectral components extracted by the FFT unit 200. The aboveprocessing is performed on all the frames from the starting frame FRthrough the frame FRN of the current time (Loop A in S202 to S210).Every time the loop is executed once, the counter i is incremented byone and the value of the counter i is substituted into the startingframe FR (S210).

Next, the harmonic structure segment final determination unit 602performs the harmonic structure frame provisional judgment processingfor provisionally judging a segment with a harmonic structure, based onthe harmonic structure value R(i) obtained in the previous processing(S212). The harmonic structure frame provisional judgment processing isdescribed later.

After the processing in S212, the harmonic structure segment finaldetermination unit 602 checks whether adjacent harmonic structuresegments are found or not, namely, whether or not the non-harmonicstructure segment length CN is longer than 0 (S214). As shown in FIG.29(a), the non-harmonic structure segment length CN indicates the lengthof the frame between the last frame of a harmonic structure segment andthe starting frame of the next harmonic structure segment.

In the case where the adjacent harmonic structure segments are found,the harmonic structure segment final determination unit 602 checkswhether or not the non-harmonic structure segment length CN is smallerthan a predetermined threshold (S216). When the non-harmonic structuresegment length CN is smaller than the predetermined threshold TH (YES inS216), the harmonic structure segment final determination unit 602concatenates the harmonic structure segments as shown in FIG. 29(b), andprovisionally judges the frames from the frame FRS2 through the frame(FRS2+CN) to be harmonic structure segments (S218). Here, FRS2 indicatesthe number of the first frame of the frames which are provisionallyjudged to be harmonic structure segments.

In the case where the non-harmonic structure segment length CN is largerthan the predetermined threshold TH (NO in S216), the harmonic structuresegments are not concatenated as shown in FIG. 29(c), and the harmonicstructure segment final determination unit 602 performs the harmonicstructure segment final determination processing to be described lateron those segments (S220). After that, the control unit 603 substitutesFRE into FSR, and also substitutes 0 into RH, RM, CH and CM (S222). Theharmonic structure segment final determination processing (S220) isdescribed later.

In the case where the adjacent harmonic structure segments are not found(NO in S214 and FIG. 29(d)), the control unit 603 judges whether theinput of the audio signal has been completed or not (S224) after theprocessing of S218 or S222. If the input of the audio signal has not yetbeen completed (NO in S224), the processing of S202 and the following isrepeated. If the input of the audio signal has been completed (YES inS224), the harmonic structure segment final determination unit 602performs the harmonic structure segment final determination processing(S226) and ends the processing. The harmonic structure segment finaldetermination processing (S226) is described later.

Next, a description is given of the harmonic structure frame provisionaljudgment processing (S212 in FIG. 28). FIG. 30 is a detailed flowchartof the harmonic structure frame provisional judgment processing. Theharmonic structure segment final determination unit 602 judges whetheror not the harmonic structure value R(i) is larger than a predeterminedharmonic structure threshold 1 (S232), and in the case where the valueR(i) is larger (YES in S232), it provisionally judges that the currentframe i is a frame with a harmonic structure. Then, it adds the harmonicstructure value R(i) to the accumulated harmonic structure value RH, andincrements the counter CH by one (S234).

Next, the harmonic structure segment final determination unit 602 judgeswhether or not the harmonic structure value R(i) is larger than theharmonic structure threshold 2 (S236), and in the case where the valueR(i) is larger (YES in S236), it provisionally judges that the currentframe i is a music frame with a harmonic structure. Then, it adds theharmonic structure value R(i) to the accumulated musical harmonicstructure value RM, and increments the counter CM by one (S236). Theabove processing is repeated for the frame FRE through the frame FRN(S230 to S238).

Next, after judging the frame FRS2 to be the frame FRS, the harmonicstructure segment final determination unit 602 judges whether or not theharmonic structure value R(i) of the current frame i is larger than theharmonic structure threshold 1 (S242), and in the case where the valueR(i) is larger, it judges that the frame FRS2 is the frame i (S244). Theabove processing is repeated for the frame FRS through the frame FRN(S240 to S246).

Next, after setting the counter CN to be 0, the harmonic structuresegment final determination unit 602 judges whether or not the harmonicstructure value R(i) of the current frame i is equal to or smaller thanthe harmonic structure threshold 1 (S250), and in the case where thevalue R(i) is equal to or smaller than the harmonic structure threshold1 (YES in S250), it provisionally judges that the frame i is anon-harmonic structure segment and increments the counter CN by one(S252). The above processing is repeated for the frame FRS2 through theframe FRN (S248 to S254). According to the above processing, segmentswith harmonic structures, segments with musical harmonic structures andnon-harmonic structure segments are provisionally judged.

Next, a detailed description of the harmonic structure segment finaldetermination processing (S220 and S226 in FIG. 28) is given. FIG. 31 isa detailed flowchart of the harmonic structure segment finaldetermination processing (S220 and S226 in FIG. 28).

The harmonic structure segment final determination unit 602 judgeswhether or not the value of the counter CH indicating the number offrames with harmonic structures is larger than the harmonic structureframe length threshold 1, and whether or not the accumulated harmonicstructure value RH is larger than (FRS−FRE)×harmonic structure threshold3 (S260). In the case where the above conditions are satisfied (YES inS260), the harmonic structure segment final determination unit 602judges that the frame FRS through the frame FRE are harmonic structureframes (S262).

The harmonic structure segment final determination unit 602 judgeswhether or not the value of the counter CM indicating the number offrames with harmonic structures is larger than the harmonic structureframe length threshold 2, and whether or not the accumulated musicalharmonic structure value RH is larger than (FRS−FRE)×harmonic structurethreshold 4 (S264). In the case where the above conditions are satisfied(YES in S264), the harmonic structure segment final determination unit602 judges that the frame FRS through the frame FRE are musical harmonicstructure frames (S266).

In the case where the above conditions are not satisfied (NO in S260) orin the case of NO in S264, it can be judged that the frame is a framewithout a musical harmonic structure but with a harmonic structure.Therefore, the harmonic structure segment final determination unit 602judges that the frame FRS through the frame FRE are non-harmonicstructure frames, and substitutes 0 into the counter CH and CN+FRE−FRSinto the counter CN (S268).

Flexible selection of the harmonic structure judgment method becomespossible, from among, for example, the use of the harmonic structureprovisional judgment in the case of frame-wise judgment, the use of theresult of the harmonic structure segment determination in the case ofmore accurate judgment, and the use of both methods by switching themaccording to the situations.

By performing the above-mentioned processing, it becomes possible todetermine harmonic structure frames, musical harmonic structure framesand non-harmonic structure frames.

As described above, according to the present embodiment, it is possibleto judge in real time whether or not an input audio signal has aharmonic structure. Therefore, it becomes possible to eliminatenon-harmonic noise, in a mobile phone or the like, with delay of apredetermined number of frames. Also, since the present embodimentallows distinction between speech and music, it becomes possible, in thecommunication using a mobile phone or the like, to code a speech partand a music part by different methods.

According to the above-described embodiments, it is possible todetermine speech segments accurately, not depending on the fluctuationof the input signal level, even if voice is produced under theenvironmental noise. It is also possible to detect speech segmentsaccurately by removing the influence of a sudden noise or a periodicnoise. Furthermore, it is possible to detect speech segments in realtime. In addition, it is possible to accurately detect, as speechsegments, consonant parts that show unclear harmonic structures. It isalso possible to remove spectral envelope components by performinglow-cut filtering on the spectral components obtained byfrequency-converting an input signal.

The speech segment detection device according to the present inventionhas been described based on the first through fifth embodiments, but thepresent invention is not limited to these embodiments.

(Modification of FFT Unit 200)

For example, in the above embodiments, a method using FFT power spectralcomponents as acoustic features has been described, but it is alsopossible to use the FFT spectral components themselves, a per-frameautocorrelation function and FFT power spectral components of a linearprediction residual in the time domain. Or, it is also possible toaccentuate a harmonic structure by widening the difference between themaximum value and the minimum value of the power spectral components,using the method of multiplying each spectral component by itself,before obtaining FFT power spectra from FFT spectra. Furthermore, it ispossible to obtain an FFT power spectrum by calculating the square rootof an FFT spectrum, instead of obtaining an FFT power spectrum bycalculating the logarithm of an FFT spectrum. Also, it is possible tomultiply each frame of time domain data by a coefficient such as Hammingwindow before obtaining FFT spectral components, or to accentuate thehigher frequency part by performing pre-accentuation processing (1-z-1).Or, it is possible to use linear spectral frequencies (LSF) as acousticfeatures. In addition, frequency transform operation is not limited toFFT, and discrete Fourier transform (DFT), discrete cosine transform(DCT) or discrete sine transform (DST) may be used.

(Modification of Harmonic Structure Extraction Unit 201)

Instead of the processing performed by the harmonic structure extractionunit 201 for removing a floor component included in a spectral componentS(f) (S26 in FIG. 3), it is possible to perform low-cut filtering on thespectral component S(f). Considering the spectral component S(f) of eachframe as a waveform in the frequency domain, a spectral envelopecomponent fluctuates more slower than a harmonic structure. Therefore,by performing low-cut filtering on the spectral component, the spectralenvelope component can be removed. This method is equivalent to removalof a low frequency component using a low-cut filter in the time domain,but it can be said that the method of filtering in the frequency domainis more desirable in that it is possible to evaluate the harmonicstructure and the information such as frequency band power and spectralenvelope at the same time. However, the spectral component calculatedusing such a low-cut filter could include not only a speech sound offrequency fluctuations caused by harmonic structures but also anon-periodic noise and a non-speech sound of a single frequency such asan electronic sound. But these sounds can be removed by the processingby the voiced feature evaluation unit 210 and the speech segmentdetermination unit 205.

As another method for removing a floor component, there is a method notusing spectral components of a predetermined reference value or lessamong spectral components. The method for calculating the referencevalue includes: a method using, as a reference value, the average valueof the spectral components of all the frames; a method using, as areference value, the average value of the spectral components in a timeduration which is enough longer than the duration of a single utterance(for example, five seconds); and a method of previously dividing thespectral component into several frequency bands and using, as areference value, the average value of the spectral components of eachfrequency band. Particularly in the case where the environment changes,for example, a quiet environment changes to a noisy one, it is moredesirable to use the average value of spectral components in a segmentof a few seconds including a current frame to be detected than to usethe average value of spectral components of all the frames.

(Modification of Inter-frame Feature Correlation Value Calculation Unit203)

The inter-frame feature correlation value calculation unit 203 maycalculate a correlation value E1(j) using the following equation (24),as a correlation function, instead of the equation (3). Here, theequation (24) indicates the cosine of the angle formed by two vectorsP(i-1) and P(i), where P(i-1) and P(i) are vectors in a 128-dimensionalvector space. The inter-frame feature correlation value calculation unit203 may calculate a correlation value E2(j), instead of the correlationvalue E1(j), according to the following equations (25) and (26), usingthe inter-frame correlation value between the frame j and a frame4-frame away from the frame j, or may calculate a correlation valueE3(j) according to the following equations (27) and (28), using theinter-frame correlation value between the frame j and a frame 8-frameaway from the frame j. As mentioned above, this modification ischaracterized in that a correlation value which is immune to a suddenenvironmental noise can be obtained by calculating a correlation valuebetween frames far away from each other.

Furthermore, it is possible to calculate a correlation value E4(j)depending on the sizes of the correlation value E1(j), the correlationvalue E2(j) and the correlation value E3(j), according to the followingequations (29) to (31), or to calculate a correlation value E5(j) thatis the result of the addition of the correlation value E1(j), thecorrelation value E2(j) and the correlation value E3(j), according tothe following equation (32), or to calculate a correlation value E6(j)that is the maximum value among the correlation value E1(j), thecorrelation value E2(j) and the correlation value E3(j), according tothe following equation (33). $\begin{matrix}{{{xcorr}\left( {{P\left( {i - 1} \right)},{P(i)}} \right)} = {\frac{{P\left( {i - 1} \right)} \cdot {P(i)}}{{{P\left( {i - 1} \right)}}{{P(i)}}} = \frac{{p\quad 1\left( {j - 1} \right) \times p\quad 1(j)} + {p\quad 2\left( {j - 1} \right) \times p\quad 2(j)} + \ldots + \left( {p\quad 128\left( {j - 1} \right) \times p\quad 128(j)} \right.}{\sqrt{{p\quad 1\left( {j - 1} \right)^{2}} + {p\quad 2\left( {j - 1} \right)^{2}} + \ldots + {p\quad 128\left( {j - 1} \right)^{2}}}\sqrt{{p\quad 1(j)^{2}} + {p\quad 2(j)^{2}} + \ldots + {p\quad 128(j)^{2}}}}}} & (24) \\{{z\quad 2(i)} = {\max\left( {{xcorr}\left( {{P\left( {i - 4} \right)},{P(i)}} \right)} \right)}} & (25) \\{{E\quad 2(j)} = {\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 2(i)}}} & (26) \\{{z\quad 3(i)} = {\max\left( {{xcorr}\left( {{P\left( {i - 8} \right)},{P(i)}} \right)} \right)}} & (27) \\{{E\quad 3(j)} = {\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 3(i)}}} & (28) \\{{E\quad 4(j)} = {z\quad 1(i)}} & (29) \\{{{if}\quad\left( {{z\quad 3(j)} > 0.5} \right)\quad E\quad 4(j)} = {{E\quad 4(j)} + {z\quad 1{(j)/z}\quad 3(j)}}} & (30) \\{{{if}\quad\left( {{z\quad 2(j)} > 0.5} \right)\quad E\quad 4(j)} = {{E\quad 4(j)} + {z\quad 1{(j)/z}\quad 2(j)}}} & (31) \\\begin{matrix}{{E\quad 5(j)} = {{E\quad 1(j)} + {E\quad 2(j)} + {E\quad 3(j)}}} \\{= {{\sum\limits_{i = {j - 2}}^{j}{z\quad 1(i)}} + {\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 2(i)}} + {\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 3(i)}}}}\end{matrix} & (32) \\\begin{matrix}{{E\quad 6(j)} = {\max\left( {{E\quad 1(j)},{E\quad 2(j)},{E\quad 3(j)}} \right)}} \\{= {\max\left( {{\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 1(i)}},{\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 2(i)}},{\sum\limits_{i = {j - 2}}^{j}\quad{z\quad 3(i)}}} \right)}}\end{matrix} & (33)\end{matrix}$

Note that the correlation values are not limited to the above six valuesE1(j) to E6(j), and a new correlation value may be calculated bycombining these correlation values. For example, it is also possible touse, based on SNR of a previously estimated input acoustic signal, thecorrelation value E1(j) when the SNR is low, while the correlation valueE2(j) or E3(j) when the SNR is high.

(Modification of Speech Segment Determination Unit 205)

The processing of the speech segment determination unit 205 which hasbeen described with reference to FIG. 6 is roughly classified into thefollowing three processes: the process for determining a voiced segmentusing a correlation value (S42 to S50); the process for concatenatingvoiced segments (S52 to S58); and the process for determining a speechsegment based on the duration of the voiced segment (S60 to S68).However, these three processes do not need to be executed in the orderas shown in FIG. 6, and they may be executed in another order. Only oneor two of these three processes may be executed. FIG. 6 shows theexample where the processing is performed on a single utterance basis,but a speech segment may be determined and corrected per frame, forexample, by performing only the process for determining the voicedsegment using the correlation value per current frame. It is alsopossible, assuming that real-time detection is requested, to output thespeech segment determined using the correlation value per frame, as apreliminary value, and separately output, on a regular basis, the speechsegment corrected and determined on a longer segment basis such as asingle utterance basis, as a determined value, so that the presentinvention is implemented as a speech detector which can meet both therequirements for real-time detection and high detected segmentperformance.

(Modification of SNR Estimation Unit 206)

The SNR estimation unit 206 may estimate SNR directly from an inputsignal. For example, the SNR estimation unit 206 obtains, from thecorrected correlation values calculated by the difference processingunit 204, the power of the S (signal) part including plus correctedcorrelation values and the power of the N (noise) part including minuscorrected correlation values, so as to obtain the SNR.

(Other Modifications)

Furthermore, it is possible to use the speech segment detection deviceas a speech recognition device for speech recognition of only speechsegments after the above speech segment detection processing isperformed as preprocessing.

It is also possible to use the speech segment detection device as aspeech recording device such as an integrated circuit (IC) recorder forrecording only speech segments after the above speech segment detectionprocessing is performed as preprocessing. As described above, byrecording only the speech segments, it becomes possible to use a storagearea of the IC recorder efficiently. It also becomes possible to extractonly the speech segments for efficient reproduction thereof using aspeech rate conversion function.

It is also possible to use the speech recognition device as a noisereduction device which removes other parts than speech segments of aninput signal so as to suppress noise.

It is further possible to use the above speech segment detectionprocessing for extracting a video part of speech segments from the videoshot by a video tape recorder (VTR) or the like, and this processing isapplicable to an authoring tool or the like for editing video.

It is also possible to extract one or more frequency bands, among thepower spectral components S′(f) shown in FIG. 4(f), in which harmonicstructures are maintained in the best manner, and perform the processingusing only these extracted bands.

It is also possible to learn noise features in non-speech segments bydetecting such segments so as to determine filtering coefficients fornoise removal, parameters for noise determination and the like. By doingso, a device for removing noise can be created.

In addition, combinations of various harmonic structure values orcorrelation values and various speech segment determination methods arenot limited to the above-mentioned embodiments.

INDUSTRIAL APPLICABILITY

Since the speech segment detection device according to the presentinvention allows accurate distinction between speech segments and noisesegments, they are useful as a preprocessing device for a speechrecognition device, an IC recorder which records only speech segments, acommunication device which codes speech segments and music segments bydifferent coding methods, and the like.

1-21. (canceled)
 22. A harmonic structure acoustic signal detectionmethod of detecting a segment that includes speech, as a speech segment,from an input acoustic signal, said method comprising: an acousticfeature extraction step of extracting an acoustic feature in each offrames into which the input acoustic signal is divided at everypredetermined time period; and a segment determination step ofevaluating continuity of the acoustic features and of determining aspeech segment according to the evaluated continuity, wherein in saidacoustic feature extraction step, frequency transform is performed oneach of the frames into which the input acoustic signal is divided atevery predetermined time period, and the acoustic feature that is avalue of a harmonic structure represented by a number is extracted, andin said segment determination step, the speech segment is determinedbased on one of the following: a correlation value between acousticfeatures in the same frame; and a correlation value between acousticfeatures in different frames.
 23. The harmonic structure acoustic signaldetection method according to claim 22, wherein in said acoustic featureextraction step, a harmonic structure is further accentuated based oneach component obtained through the frequency transform, and theacoustic feature is extracted.
 24. The harmonic structure acousticsignal detection method according to claim 23, wherein in said acousticfeature extraction step, a harmonic structure is further extracted fromeach component obtained through the frequency transform, and a componentwhich is obtained through the frequency transform and has apredetermined frequency band that includes the harmonic structure isjudged to be the acoustic feature.
 25. The harmonic structure acousticsignal detection method according to claim 22, wherein in said acousticfeature extraction step, each component obtained through the frequencytransform of each frame is further divided into frequency bands of apredetermined bandwidth, a correlation value is calculated between thecomponents that have predetermined frequency bands in the same frame,and the acoustic feature is extracted based on the calculatedcorrelation value.
 26. The harmonic structure acoustic signal detectionmethod according to claim 25, wherein in said acoustic featureextraction step, a difference is further calculated between a maximumvalue and a minimum value of the correlation values in each frame, andthe acoustic feature is extracted based on the difference.
 27. Theharmonic structure acoustic signal detection method according to claim22, wherein in said acoustic feature extraction step, each componentobtained through the frequency transform of each frame is furtherdivided into frequency bands of a predetermined bandwidth, a correlationvalue is calculated between the components that have predeterminedfrequency bands in different frames, and the acoustic feature isextracted based on the calculated correlation value.
 28. The harmonicstructure acoustic signal detection method according to claim 27,wherein in said acoustic feature extraction step, a difference isfurther calculated between a maximum value and a minimum value of thecorrelation values in each frame, and the acoustic feature is extractedbased on the difference.
 29. The harmonic structure acoustic signaldetection method according to claim 22, wherein in said segmentdetermination step, continuity of the acoustic features is evaluatedbased on a correlation value between the acoustic features of differentframes, and the speech segment is determined according to the evaluatedcontinuity.
 30. The harmonic structure acoustic signal detection methodaccording to claim 22, wherein in said segment determination step,continuity of the acoustic features is evaluated based on distributionsof the acoustic features in different frames, and the speech segment isdetermined according to the evaluated continuity.
 31. The harmonicstructure acoustic signal detection method according to claim 22,comprising: an evaluation step of calculating an evaluation value forevaluating the continuity of the acoustic features; and a speech segmentdetermination step of evaluating temporal continuity of the evaluationvalues and of determining a speech segment according to the evaluatedtemporal continuity.
 32. The harmonic structure acoustic signaldetection method according to claim 31, wherein said segmentdetermination step further includes: a step of estimating a speechsignal-to-noise ratio of the input acoustic signal based on comparisons,for a predetermined number of frames, between (i) acoustic featuresextracted in said acoustic feature extraction step or the evaluationvalues calculated in said evaluation step and (ii) a first predeterminedthreshold; and a step of determining the speech segment based on theevaluation value calculated in said evaluation step, in the case wherethe estimated speech signal-to-noise ratio is equal to or higher than asecond predetermined threshold, and in said speech segment determinationstep, the temporal continuity of the evaluation values is evaluated andthe speech segment is determined according to the evaluated temporalcontinuity, in the case where the speech signal-to-noise ratio is lowerthan the second predetermined threshold.
 33. The harmonic structureacoustic signal detection method according to claim 22, wherein saidsegment determination step includes: an evaluation step of calculatingan evaluation value for evaluating the continuity of the acousticfeatures; and a non-speech harmonic structure segment determination stepof evaluating temporal continuity of the evaluation values anddetermining, according to the evaluated temporal continuity, anon-speech harmonic structure segment that has a harmonic structure butis not a speech segment.
 34. The harmonic structure acoustic signaldetection method according to claim 33, wherein said acoustic featureextraction step includes: a frequency transform step of performingfrequency transform on each of the frames into which the input acousticsignal is divided at every predetermined time period; a correlationvalue calculation step of dividing a component obtained through thefrequency transform of each frame into frequency bands of apredetermined bandwidth, and of calculating a correlation value betweenthe components that have predetermined frequency bands in the sameframe; and an extraction step of extracting, as the acoustic feature, anidentifier of a frequency band in which the component has a maximumvalue or a minimum value of the correlation values in the same frame.35. The harmonic structure acoustic detection method according to claim22, wherein said acoustic feature extraction step includes: a frequencytransform step of performing frequency transform on each of the framesinto which the input acoustic signal is divided at every predeterminedtime period; a correlation value calculation step of calculating acorrelation value between components obtained through the frequencytransform of frames which are a predetermined number of frames away fromeach other; and an acoustic feature extraction step of extracting theacoustic feature that is a value of a harmonic structure represented bya number, by calculating a distribution of the correlation values inevery predetermined number of frames.
 36. The harmonic structureacoustic signal detection method according to claim 22, wherein in saidsegment determination step, the continuity is evaluated based oncorrelation values between two or more types of frames of different timeperiods.
 37. The harmonic structure acoustic signal detection methodaccording to claim 36, wherein in said segment determination step, oneof the correlation values between the two or more types of frames ofdifferent time periods is selected based on a speech signal-to-noiseratio of the input acoustic signal, and the continuity is evaluatedbased on the selected correlation value.
 38. The harmonic structureacoustic signal detection method according to claim 22, wherein in saidsegment determination step, the continuity is evaluated based on acorrected correlation value calculated using a difference between (i) acorrelation value between the acoustic features of frames and (ii) anaverage value of the correlation values of a predetermined number offrames.
 39. A harmonic structure acoustic signal detection device whichdetects a segment that includes speech, as a speech segment, from aninput acoustic signal, said device comprising: an acoustic featureextraction unit operable to extract an acoustic feature in each offrames into which the input acoustic signal is divided at everypredetermined time period; and a segment determination unit operable toevaluate continuity of the acoustic features, and to determine a speechsegment according to the evaluated continuity, wherein said acousticfeature extraction unit is operable to perform frequency transform oneach of the frames into which the input acoustic signal is divided atevery predetermined time period, and to extract the acoustic featurethat is a value of a harmonic structure represented by a number, andsaid segment determination unit is operable to determine the speechsegment based on one of the following: a correlation value betweenacoustic features in the same frame; and a correlation value betweenacoustic features in different frames.
 40. A speech recognition devicewhich recognizes speech included in an input acoustic signal, saiddevice comprising: an acoustic feature extraction unit operable toextract an acoustic feature in each of frames into which the inputacoustic signal is divided at every predetermined time period; a segmentdetermination unit operable to evaluate continuity of the acousticfeatures, and to determine a speech segment according to the evaluatedcontinuity; and a recognition unit operable to recognize speech in thespeech segment determined by said segment determination unit, whereinsaid acoustic feature extraction unit is operable to perform frequencytransform on each of the frames into which the input acoustic signal isdivided at every predetermined time period, and to extract the acousticfeature that is a value of a harmonic structure represented by a number,and said segment determination unit is operable to determine the speechsegment based on one of the following: a correlation value betweenacoustic features in the same frame; and a correlation value betweenacoustic features in different frames.
 41. A speech recording devicewhich records speech included in an input acoustic signal, said devicecomprising: an acoustic feature extraction unit operable to extract anacoustic feature in each of frames into which the input acoustic signalis divided at every predetermined time period; a segment determinationunit operable to evaluate continuity of the acoustic features, and todetermine a speech segment according to the evaluated continuity; and arecording unit operable to record the input acoustic signal in thespeech segment determined by said segment determination unit, whereinsaid acoustic feature extraction unit is operable to perform frequencytransform on each of the frames into which the input acoustic signal isdivided at every predetermined time period, and to extract the acousticfeature that is a value of a harmonic structure represented by a number,and said segment determination unit is operable to determine the speechsegment based on one of the following: a correlation value betweenacoustic features in the same frame; and a correlation value betweenacoustic features in different frames.
 42. A program which causes acomputer to execute: an acoustic feature extraction step of extractingan acoustic feature in each of frames into which the input acousticsignal is divided at every predetermined time period; and a segmentdetermination step of evaluating continuity of the acoustic features andof determining a speech segment according to the evaluated continuity,wherein in said acoustic feature extraction step, frequency transform isperformed on each of the frames into which the input acoustic signal isdivided at every predetermined time period, and the acoustic featurethat is a value of a harmonic structure represented by a number isextracted, and in said segment determination step, the speech segment isdetermined based on one of the following: a correlation value betweenacoustic features in the same frame; and a correlation value betweenacoustic features in different frames.